subtask decomposition
Parameterizing Non-Parametric Meta-Reinforcement Learning Tasks via Subtask Decomposition
Meta-reinforcement learning (meta-RL) techniques have demonstrated remarkable success in generalizing deep reinforcement learning across a range of tasks. Nevertheless, these methods often struggle to generalize beyond tasks with parametric variations. To overcome this challenge, we propose Subtask Decomposition and Virtual Training (SDVT), a novel meta-RL approach that decomposes each non-parametric task into a collection of elementary subtasks and parameterizes the task based on its decomposition. We employ a Gaussian mixture VAE to meta-learn the decomposition process, enabling the agent to reuse policies acquired from common subtasks. Additionally, we propose a virtual training procedure, specifically designed for non-parametric task variability, which generates hypothetical subtask compositions, thereby enhancing generalization to previously unseen subtask compositions. Our method significantly improves performance on the Meta-World ML-10 and ML-45 benchmarks, surpassing current state-of-the-art techniques.
Parameterizing Non-Parametric Meta-Reinforcement Learning Tasks via Subtask Decomposition
Meta-reinforcement learning (meta-RL) techniques have demonstrated remarkable success in generalizing deep reinforcement learning across a range of tasks. Nevertheless, these methods often struggle to generalize beyond tasks with parametric variations. To overcome this challenge, we propose Subtask Decomposition and Virtual Training (SDVT), a novel meta-RL approach that decomposes each non-parametric task into a collection of elementary subtasks and parameterizes the task based on its decomposition. We employ a Gaussian mixture VAE to meta-learn the decomposition process, enabling the agent to reuse policies acquired from common subtasks. Additionally, we propose a virtual training procedure, specifically designed for non-parametric task variability, which generates hypothetical subtask compositions, thereby enhancing generalization to previously unseen subtask compositions. Our method significantly improves performance on the Meta-World ML-10 and ML-45 benchmarks, surpassing current state-of-the-art techniques.
Exponential Speedups by Rerooting Levin Tree Search
Orseau, Laurent, Hutter, Marcus, Lelis, Levi H. S.
We are interested in tree search algorithms for deterministic domains. Tree search algorithms such as all variants of best-first search -- including A* [Hart et al., 1968], Weighted-A* (WA*), and Greedy Best-First Search (GBFS) [Doran et al., 1966] -- and variants of MCTS -- such as UCT [Kocsis and Szepesvári, 2006], AlphaGo, AlphaZero and other variants [Silver et al., 2016, 2017b,a] -- explore the search tree starting at the root and can visit a node only if its parent has been visited first. These algorithms are often guided by some side information, such as with cost-to-go heuristic function for A*, WA* and GBFS, a reward/value function for UCT and AlphaZero, or a policy for AlphaZero, Levin Tree Search (LTS) [Orseau et al., 2018], and Policy-Guided Heuristic Search [Orseau and Lelis, 2021]. Such algorithms also sometimes come with different types of guarantees: A* and WA*, with an admissible heuristic function -- i.e., a function that never overestimates the optimal costto-go -- are guaranteed to return a solution that is cost-optimal (A*) or bounded-suboptimal (WA*), while UCT and AlphaZero are guaranteed to (eventually) have low regret in terms of cumulative reward during the search. LTS is guaranteed to return a solution within a number of search steps that depends on the quality of its policy. In this paper, we consider the latter type of guarantee, on the efficiency of the search process depending on the quality of the side information. To explain the main concepts of this paper, let us consider a kind of side information we call clues: some nodes are clue nodes, and a node can be known to be a clue node only when reaching it. A clue may be helpful if it is on the path toward a solution node, or misleading otherwise. The following example describes a minimalistic clue environment.